Published on

窗口函数 percent_rank() 和 rank() 因 order by 顺序不一致影响计算结果

Authors
  • avatar
    Name
    Wang Zhiwei
    Twitter

查阅文档可知

percent_rank = (r-1)/(n-1)
r = rank
n = partition_count

在计算过程中发现 precent_rankrank 函数的结果不匹配,如以下 case 1 中所述:

rank19800
partition_count26774
(r-1)/(n-1)0.7395
1-(r-1)/(n-1)0.2604
percent_rank0.1235

直觉上 percent_rank 结果应该等于 1-(r-1)/(n-1) ,但是计算结果并不相等

又进行 case 2 尝试,得到计算结果:

rank19800
partition_count26774
(r-1)/(n-1)0.7395
1-(r-1)/(n-1)0.2604
percent_rank0.7395

从而推断出,因为计算 rankpercent_rankorder by 顺序不一致会导致结果出现差异,正确的是应该保持两个函数按相同的顺序计算。

查阅 rank 函数的定义,发现使用 rank 计算如果存在并列名次会进行跳跃排名,产生间隙,可以肯定是因为这个排名间隙导致升序和降序排名时 percent_rank 结果会不一致。

rank() → bigint** **Returns the rank of a value in a group of values. The rank is one plus the number of rows preceding the row that are not peer with the row. Thus, tie values in the ordering will produce gaps in the sequence. The ranking is performed for each window partition.


case 1:

rankpercent_rank计算结果
descasc不一致
WITH company AS (
  SELECT
    company_id,
    element_at(seic, 1).seic_lv2_code AS seic_lv2_code,
    score_dimension.tech_layout AS tech_layout
  FROM
    npd_dw.innovation_company
  WHERE
    status = 'ACTIVE'
),
scores_ranking AS (
  SELECT
    company_id,
    seic_lv2_code,
    -- 技术布局评分
    tech_layout,
    RANK() OVER (
      PARTITION BY seic_lv2_code
      ORDER BY
        tech_layout DESC
    ) AS tech_layout_rank,
    PERCENT_RANK() OVER (
      PARTITION BY seic_lv2_code
      ORDER BY
        tech_layout
    ) AS tech_layout_percent_rank,
    COUNT(company_id) OVER (PARTITION BY seic_lv2_code) AS seic_company_cnt
  FROM
    company
)
SELECT
  company_id,
  seic_lv2_code,
  tech_layout,
  tech_layout_rank,
  tech_layout_percent_rank,
  seic_company_cnt,
  (CAST(tech_layout_rank AS DOUBLE) -1) /(seic_company_cnt -1) AS tech_layout_percent_rank_desc
FROM
  scores_ranking
WHERE
  company_id = '81F08062408E4DB9D16D71B0094F74D7'

Untitled.png


case 2:

rankpercent_rank计算结果
descdesc一致
WITH company AS (
  SELECT
    company_id,
    element_at(seic, 1).seic_lv2_code AS seic_lv2_code,
    score_dimension.tech_layout AS tech_layout
  FROM
    npd_dw.innovation_company
  WHERE
    status = 'ACTIVE'
),
scores_ranking AS (
  SELECT
    company_id,
    seic_lv2_code,
    -- 技术布局评分
    tech_layout,
    RANK() OVER (
      PARTITION BY seic_lv2_code
      ORDER BY
        tech_layout DESC
    ) AS tech_layout_rank,
    PERCENT_RANK() OVER (
      PARTITION BY seic_lv2_code
      ORDER BY
        tech_layout DESC
    ) AS tech_layout_percent_rank,
    COUNT(company_id) OVER (PARTITION BY seic_lv2_code) AS seic_company_cnt
  FROM
    company
)
SELECT
  company_id,
  seic_lv2_code,
  tech_layout,
  tech_layout_rank,
  tech_layout_percent_rank,
  seic_company_cnt,
  (CAST(tech_layout_rank AS DOUBLE) -1) /(seic_company_cnt -1) AS tech_layout_percent_rank_desc
FROM
  scores_ranking
WHERE
  company_id = '81F08062408E4DB9D16D71B0094F74D7'

Untitled.png